Exploration of Different Imputation Methods for Missing Data

Karthik Aerra, Liz Miller, Mohit Kumar Veeraboina, Robert Stairs

2024-07-28

Introduction

Introduction

  • Missing data in real-world datasets is a key challenge.
  • Causes: Processing errors, measurement errors, survey non-responses, invalid calculations, participant dropout.
  • Impact: Reduces usable observations, introduces bias, interferes with analysis tools.
  • Importance: Accurate handling is critical for reliable data analysis.

Identifying Missing Data

  • Representation: Standard (NA, NaN, Null, None), strings (blanks, spaces, empty strings), numerical placeholders (extreme values like 999999).
  • Useful R Functions: is.na() counts standard missing values, unique() identifies unexpected strings.
  • Visual Tools: Histograms for numerical placeholders.
  • Example: Survey dataset with missing age values represented as blanks.

Missing Data Mechanisms

  • MCAR (Missing Completely at Random): Independent of observed and unobserved data. Example: Random network connectivity loss. Impact: Deletion reduces data size, no bias.
  • MAR (Missing at Random): Depends on observed data. Example: Age omission in surveys by specific demographics. Impact: Potential bias if not handled.
  • MNAR (Missing Not at Random): Related to unobserved data. Example: Health survey non-responses due to severity of illness. Impact: Complex analysis, prone to bias.

Techniques for Handling Missing Data

Deletion Methods:

  • Listwise Deletion: Removes rows with any missing values. Simple but reduces dataset size.
  • Feature Selection: Removes columns with high proportions of missing data. Maintains observation count but reduces feature count.
  • Example: Removing rows with missing income data in a financial dataset.

Single Imputation:

  • Mean/Median Imputation: Replaces missing values with the mean/median of the column.
  • Hot Deck Imputation: Uses values from similar records to fill missing data.
  • Regression-based Imputation: Predicts missing values using regression models based on other variables.
  • Example: Replacing missing temperature readings with the average temperature of the day.

Multiple Imputation:

  • MICE (Multiple Imputation by Chained Equations): Iteratively imputes missing data by predicting and incorporating uncertainty.
  • Process: Generates multiple datasets, imputes missing values, combines results for final analysis.
  • Example: Using MICE to impute missing survey responses across several related questions.

Theories for Missing Data

Rubin’s Missing Data Theory:

  • Categories: MCAR, MAR, MNAR.
  • Framework for understanding mechanisms behind missing data.
  • Guides selection of imputation strategies based on these mechanisms.

Statistical Theory of Imputation:

  • Uses statistical models to estimate missing values.
  • Multiple imputation generates multiple plausible values for each missing data point.
  • Example: Using multiple imputation to handle missing financial data in economic research.

Machine Learning Theories:

  • Employ techniques like k-nearest neighbors (KNN) and neural networks.
  • Predict missing values based on patterns in observed data.
  • Example: Using KNN to fill in missing entries in a customer database.

Research Methods for Missing Data

Descriptive Studies:

  • Analyze patterns of missing data.
  • Example: Jerez et al. analyzed missing data in medical records.

Comparative Studies:

  • Evaluate the performance of different imputation methods.
  • Example: Ibrahim and Zheng compared methods in clinical trial data.

Simulation Studies:

  • Test imputation methods on controlled datasets.
  • Allow systematic observation of the impact of different techniques.
  • Example: Using simulated datasets to evaluate the accuracy of imputation methods.

Application-Based Studies:

  • Real-world implementation examples.
  • Example: Saqib Ejaz Awan demonstrated imputation methods in large-scale survey data.

Overview of Methodology

Classification of Data

Randomness Testing

  • Little’s MCAR Test:
  • Hawkin’s Test
  • Non-Parametric Test
  • Data Pattern Visualizations

Imputation Methods

  • Deletion:
    • List-wise
    • Feature selection

Imputation Methods

Simple:

  • Mean

  • Median

  • Mode

Complex:

  • MICE

  • missForest

Model-Fitting: Decision Tree

Model-Fitting - Random Forest

Model-Fitting - KNN

Analysis and Results

Introduction to the Dataset

  • “Ozone: Los Angeles Ozone Pollution Data, 1976” from mlbench package in R (“Ozone”)
  • It contains observations related to pollution levels in the Los Angeles area during 1976.
  • It contains a total of 13 variables, 366 observations (one for each day for one year)
  • We chose this dataset due to the high volume of already-missing data
  • Demonstration of a real-life scenario, where missing values are truly unknown and effectiveness of imputation methods cannot directly be measured.
  • It is up to the investigator to choose a methodology for handling the missing data and appropriate metrics for evaluating effectiveness of missing data methods

Variables in the Ozone Dataset

  • V1: Month, 1-12, where 1 is January and 12 is December
  • V2: Day of month
  • V3: Day of week, 1-7, where 1 is Monday and 7 is Sunday
  • V4: Daily maximum one-hour-average ozone reading
  • V5: 500 millibar pressure height (m) measured at Vandenberg AFB
  • V6: Wind speed (mph) at Los Angeles International Airport (LAX)
  • V7: Humidity (%) at LAX
  • V8: Temperature (degrees F) measured at Sandburg, CA
  • V9: Temperature (degrees F) measured at El Monte, CA
  • V10: Inversion base height (feet) at LAX
  • V11: Pressure gradient (mm Hg) from LAX to Daggett, CA
  • V12: Inversion base temperature (degrees F) at LAX
  • V13: Visibility (miles) measured at LAX

Summarizing the Data

  • The summary() function shows the number of NAs for each column
  • Most columns have <20 NAs, however V9 contains 139 missing values
  • V9: Temperature (degrees F) measured at El Monte, CA
       V4              V1            V2      V3           V5      
 Min.   : 1.00   1      : 31   1      : 12   1:52   Min.   :5320  
 1st Qu.: 5.00   3      : 31   2      : 12   2:52   1st Qu.:5700  
 Median : 9.00   5      : 31   3      : 12   3:52   Median :5770  
 Mean   :11.53   7      : 31   4      : 12   4:53   Mean   :5753  
 3rd Qu.:16.00   8      : 31   5      : 12   5:53   3rd Qu.:5830  
 Max.   :38.00   10     : 31   6      : 12   6:52   Max.   :5950  
 NA's   :5       (Other):180   (Other):294   7:52   NA's   :12    
       V6               V7              V8              V9       
 Min.   : 0.000   Min.   :19.00   Min.   :25.00   Min.   :27.68  
 1st Qu.: 3.000   1st Qu.:49.00   1st Qu.:51.00   1st Qu.:49.73  
 Median : 5.000   Median :65.00   Median :62.00   Median :57.02  
 Mean   : 4.869   Mean   :58.48   Mean   :61.91   Mean   :56.85  
 3rd Qu.: 6.000   3rd Qu.:73.00   3rd Qu.:72.00   3rd Qu.:66.11  
 Max.   :11.000   Max.   :93.00   Max.   :93.00   Max.   :82.58  
                  NA's   :15      NA's   :2       NA's   :139    
      V10            V11             V12             V13       
 Min.   : 111   Min.   :-69.0   Min.   :27.50   Min.   :  0.0  
 1st Qu.: 890   1st Qu.:-10.0   1st Qu.:51.26   1st Qu.: 70.0  
 Median :2125   Median : 24.0   Median :62.24   Median :110.0  
 Mean   :2591   Mean   : 17.8   Mean   :60.93   Mean   :123.3  
 3rd Qu.:5000   3rd Qu.: 45.0   3rd Qu.:70.52   3rd Qu.:150.0  
 Max.   :5000   Max.   :107.0   Max.   :91.76   Max.   :500.0  
 NA's   :15     NA's   :1       NA's   :14                     

Data Pre-Processing Summary

  • All columns converted to numeric data type
  • Renamed columns for clearer interpretation during analysis and reporting
  • Combined month, day of month and day of week into one data column

Data Exploration Summary

  • Summary statistics by day of week, month
  • Histograms to visualize distributions of data
  • Correlation coefficients for all features with respect to Ozone levels (output)
  • Strong Positive Correlations: Humidity_LAX, Pressure_afb, IBT_LAX, Temp_EM, and Temp_sandburg
  • Strong Negative Correlations: IBH_LAX and Visibility_LAX

Exploration of the Missing Data

  • The total number of missing data points for each column is shown below as a count and as a percentage.
  • Most of the columns contain <5% missing values. Temp_EM contains 139 missing values, which is 38.0% of the observations.
  • The total number of missing values in the dataset is 203 out of 4,768 (13 columns times 366 observations). This represents 4.3% of the entire dataset.
  • There are a few instances where more than one column has missing data, but the majority of rows with missing values are only missing Temp_EM

Statistical Testing for MCAR (Missing Completely at Random)

  • The MissMech package was used to test for MCAR
  • First, check for normality of dependent variable (ozone levels)
  • If normally distributed, use Hawkin’s test
  • If not normally distributed, use non-parametric test
  • Shapiro-Wilk and Anderson Darling tests indicate non-normality (p < 0.05)
  • Q-Q plot and histogram visibly show non-normal distribution
  • non-parametric test p-value: 0.27, fail to reject null hypothesis that data are MCAR

Further Visualization of the Missingess Pattern

Missing Data Deletion or Imputation

  • Listwise deletion (delete rows with missing data)
  • Feature selection (delete columns with >20% missing data) -> Temp_EM column
  • Mean, median, or mode imputation
  • MICE
  • Random Forest
  • A total of 7 datasets were created using various methods for handling missing data

Model-Fitting of Imputed Datasets

  • Seven missing data methods -> Seven datasets
  • Each dataset was split into train/test datasets using a split ratio of 0.75
  • Models were fit for Ozone levels using random forest, decision tree, or KNN algorithms
  • Resulting RMSE for test data was used as a metric for best missing data methodology

Model Performance for Each Missing Data Methodology

  • The RMSE for each model and imputation method combination was summarized into a data frame
  • The best model fit (lowest RMSE) was obtained when using feature selection (column deletion) for missing data and the random forest algorithm for model training
  • Simple methods likely outperformed more advanced methods in this case since statistical testing supports that data are MCAR
  • There is no single best method for handling missing data, it will always depend on the context
  • Testing of multiple methods is the typical approach
          Model  ImpMethod     RMSE
1  RandomForest    DropCol 3.659657
3  DecisionTree       Mean 4.266184
19 RandomForest       Mode 4.367134
16 RandomForest     DropNA 4.493497
4  RandomForest     Median 4.591508
7  RandomForest       MICE 4.901004
10 RandomForest       Mean 5.073677
21 DecisionTree       MICE 5.077593
9  DecisionTree     DropNA 5.213247
12 DecisionTree       Mode 5.249752
13 RandomForest missForest 5.398429
18 DecisionTree     Median 5.537124
2           KNN     DropNA 5.548309
6  DecisionTree missForest 5.799608
8           KNN    DropCol 6.117349
17          KNN       Mean 6.118703
20          KNN missForest 6.142781
5           KNN       Mode 6.161134
11          KNN     Median 6.298989
14          KNN       MICE 6.371706
15 DecisionTree    DropCol 6.603588

Model Performance for Each Missing Data Methodology

Conclusions

Conclusions

  • In analytical processes, managing missing data is essential to preserving data integrity.
  • We investigated a number of approaches to deal with missing data, such as imputation and deletion algorithms.
  • Imputation techniques included mean, median, mode, random forest, MICE
  • Deletion methods included listwise (row deletion) and feature selection (column deletion)
  • It is necessary to comprehend the missing data mechanism (MCAR, MAR, MNAR) in order to choose the best imputation techniques.

Conclusions Continued

  • The efficacy of every imputation technique was evaluated using evaluation measures such as RMSE and R-Squared on the Ozone dataset.
  • The features of the dataset and the goals of the analysis should guide the choice of imputation technique.
  • When deletion procedures are used and the missing data are not entirely random, skewed conclusions may arise.
  • While single imputation techniques are straightforward, they may understate data variability.
  • Subsequent investigations may delve deeper into sophisticated machine learning methodologies to enhance the precision of missing data imputation.